Preface

Data Source

Exploratory Data Analysis

Observations

dim(simpsons)
## [1] 725  12
names(simpsons)
##  [1] "id"                     "title"                  "description"           
##  [4] "original_air_date"      "production_code"        "directed_by"           
##  [7] "written_by"             "season"                 "number_in_season"      
## [10] "number_in_series"       "us_viewers_in_millions" "imdb_rating"
summary(simpsons)
##        id         title           description        original_air_date   
##  Min.   :  0   Length:725         Length:725         Min.   :1989-12-17  
##  1st Qu.:181   Class :character   Class :character   1st Qu.:1997-10-26  
##  Median :362   Mode  :character   Mode  :character   Median :2005-11-27  
##  Mean   :362                                         Mean   :2006-01-04  
##  3rd Qu.:543                                         3rd Qu.:2014-03-16  
##  Max.   :724                                         Max.   :2022-05-22  
##  production_code    directed_by         written_by            season     
##  Length:725         Length:725         Length:725         Min.   : 1.00  
##  Class :character   Class :character   Class :character   1st Qu.: 9.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :17.00  
##                                                           Mean   :16.94  
##                                                           3rd Qu.:25.00  
##                                                           Max.   :33.00  
##  number_in_season  number_in_series us_viewers_in_millions  imdb_rating   
##  Min.   :   1.00   Min.   :     1   Length:725             Min.   :4.000  
##  1st Qu.:   6.00   1st Qu.:   182   Class :character       1st Qu.:6.600  
##  Median :  12.00   Median :   363   Mode  :character       Median :7.100  
##  Mean   :  15.98   Mean   :  3123                          Mean   :7.166  
##  3rd Qu.:  17.00   3rd Qu.:   544                          3rd Qu.:7.700  
##  Max.   :1920.00   Max.   :712713                          Max.   :9.300
sum(is.na(simpsons))
## [1] 0
head(simpsons) %>% rmarkdown::paged_table()
summary(simpsons$length_description)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   18.00   23.00   23.92   28.00   98.00
  • Most variables are in some way viable as predictor variable
    • i.e. Use description text for text analysis (frequency, sentiment, etc.) as opposed to completely ignoring
  • No variable for episode run time (Can assume ~22 mins per avg run time for 30 minute tv slot)
  • Does air date include time slot? (Relevance being primetime, Friday death slot for low rating shows pending cancellation)
    • Upon further inspection, shouldn’t be too relevant
  • Rating will be most interesting to analyze/use as independent variable
  • Variables changed for ease of access
    • imdb_rating to rating
    • us_viewers_in_millions to viewers
  • Adding two new variables:
    • length_description and age (Time since original air date)
      • At time of starting this analysis, Season 34 has already started airing so if any episode age analysis is conducted will be based off this date as newer seasons start around this time
      • Shortest episode description is 5 words:
      simpsons$description[which(simpsons$length_description == 5)]
      ## [1] "Marge becomes a real-estate agent."      
      ## [2] "Marge accidentally gets breast implants."
      ## [3] "Fat Tony becomes Maggie's godfather."
      • 3 cases (187, 295, 714)

Initial observations off of summary:

  • 725 x 12 table
  • 724 episodes, 33 seasons
  • 1st air date, 12-17-1989
    • If memory serves, it’s Christmas special with Santa’s Little Helper (family dog)
  • Lowest rating 4.0, but note IMDB ratings skewed / bias by nature

Potential errors in data based off summary:

  1. Error in number_in_series (Max 712713)

  2. Number episode in season (Max 1920)

  3. No blatant missing values but some numeric values listed as different type

    • (us_viewers_in_million) is char type

Refer to README for other preliminary questions

  • Misc:
    • Potential filter data by:
      • Treehouse of Horror
      • First and last episode of season
    • New external data points to consider:
      • Disney acquisition of Hulu / Fox and FX
        • Fox: March 19, 2019
        • “In March 2019, Disney acquired 21st Century Fox, giving it a 60% majority stake in Hulu”
        https://en.wikipedia.org/wiki/Hulu#Disney_majority_ownership
      • Sociopolitical/culture events (i.e. Presidental election episodes)
      • Simpsons movie release date: July 27, 2007

Data Error Discussion

A. Number In Series

rmarkdown::paged_table(head(arrange(simpsons, desc(number_in_series))))
#DT::datatable(head(arrange(simpsons, desc(number_in_series))))
subsetDF <- c("id", "title", "production_code", "directors", "writers", "number_in_season", "number_in_series")
simpsons_sub <- simpsons[,subsetDF]
  • Sorted by max number_in_series in which reveals there cases where records were wrong
    • Number in series/id: 607, 679, 709
    • While scrolling through the table, other columns also stood out (and subsetted by following variables accordingly for further investigation)
subsetDF
## [1] "id"               "title"            "production_code"  "directors"       
## [5] "writers"          "number_in_season" "number_in_series"

B. Number in Season

#simpsons[which.max(simpsons$number_in_season),]
#head(arrange(simpsons, desc(number_in_season)))
rmarkdown::paged_table(head(arrange(simpsons_sub[c("id", "title","number_in_season")], desc(number_in_season))))
  • Two errors where the episode number in season is over 1k
  • Interesting to note that both eps are also wrong in num_in_series
    • Assuming the data is scrapped off of Wikipedia, upon further inspection they are 2-part episode(s) (also noted by episode title)
      • Scrapped to where information is concatenated together
        • i.e.number_in_season is 1213
          • 2 part episode, of episodes 12 and 13 respectively
        • Note: Episode 13 does not officially exist
        • If need to reference by number_in_series use id, however number_in_season adjusted accordingly as two episodes
      • This concatenation in episode titles, multiple writers, and different production codes data differs
  • Since it’s a two-part episode that premiered together, can assume same concurrent viewers for both part on same date
  • Raises issue of should the episodes be treated separately as different parts, or one whole? For sake of consistency, they will be two separate entities (since they have different writers and production codes)

C. Viewers

simpsons$viewers <- as.numeric(simpsons$us_viewers_in_millions)
## Warning: NAs introduced by coercion
which(is.na(as.numeric(simpsons$viewers)))
## [1] 160 161 173
simpsons[c(160, 161, 173),c("us_viewers_in_millions", "viewers")]
## # A tibble: 3 × 2
##   us_viewers_in_millions viewers
##   <chr>                    <dbl>
## 1 N/A                         NA
## 2 N/A                         NA
## 3 N/A                         NA
simpsons_rm <- simpsons[-c(160, 161, 173),]
  • Simply refactor by as.numeric() for further application
  • Missing data points, originally missed because of type issue
    • is.na() didn’t detect because it was of class char
    • Data is missing from Wikipedia
      • Interesting observation made – Wikipedia articles report viewership in both (presumed) individual viewers and viewing households
        • i.e. Compare the table of episode / viewer to individual episode rating under reception for Nielsen rating and viewing households
        • Ratio comparing existing data of viewers to viewing household came to be about (1.7~1.8):1 (for case of season 8)
    • Season: 8, Episode(s): 7,8,20
  • Multiple ways to treat missing data for forecasting
    • Will try all 3, will start out with deleting missing rows.
    • Noted mean calculations earlier can be used for method 2.
    • Inputation will use other packages.

(Match id to 607, 679, 709)


Graphs

Note: Graphs expected to fall under time series analysis, and using id/original_air_date/age are somewhat interchangeable since they are effectively factors/‘categorical.’

Seasonal

Ratings

#rating_v_sason <- ggplot(data = tempSimps, aes(x = season, y = rating, color = season))

temp <- ggplot(data = tempSimps, aes(x = season, y = rating, color=season)) + 
  geom_boxplot() + ggtitle("IMDB Rating vs Simpsons Seasons") + xlab("Season") + ylab("Rating") + 
  scale_y_continuous(name="IMDB Rating",limits=c(4, 10))

#theme(legend.position = "none") +
ggplotly(temp)
sznAvg <- tempSimps %>% group_by(season) %>% summarise(sznAvg = mean(rating))

temp2 <- ggplot(data = sznAvg, aes(x = season, y = sznAvg)) + geom_point() + 
  ggtitle("Average IMDB Rating vs Simpsons Season") + xlab("Season") + ylab("Average IMDB Rating")

ggplotly(temp2)
  • Outliers in this case are more interesting to look at when compared to viewers
  • Ratings seem to plateau as seasons progress, but the occasional outlier does exist
    • Doesn’t necessarily mimic viewership count
      • Note, viewership could also decrease with growth of streaming bubble – people are not pressed to watch shows live / on premiere unless a devout fans / avid watcher
  • Seasons with large spread: 8, 23, and potentially 32
  • S8 has highest maximum, alterantively S23 lowest:
  • When looking solely at the averages, the declining pattern is more definite
  • Season 30 has a large drop off from 29, then recovers the following seasons
    • comparable to season 8 to 9

Viewers

tempViewers <- ggplot(data = tempSimps, aes(x = season, y = as.numeric(viewers), color=season)) +
  geom_boxplot() + ggtitle("Viewers vs Simpsons Seasons") + xlab("Season") + ylab("Viewers (p/ mil)") +
  theme(legend.position = "none")

ggplotly(tempViewers)
tempJitter <- ggplot(tempSimps,aes(x=season, y= as.numeric(viewers), color=season)) +
      geom_boxplot() +
      geom_jitter(width=0.25, alpha=0.5)
ggplotly(tempJitter)
  • Viewership does seemingly appear to decrease over time
  • More outliers exist for maxima rather than minima
    • Two cases (Season 1 and 13) where the outlier are minima
    • Most extreme case of maximum in S16 E8 (id = 342)
      • Interesting to see how rest of maxima not even close to same viewership
      • Gap b/w the maximum and q3 largest compared to rest of seasons
      • Why is it this specific episode that makes it an outlier?
  • https://en.wikipedia.org/wiki/Homer_and_Ned%27s_Hail_Mary_Pass

Episodic Data

Ratings

tempSimps2 <- tempSimps
tempSimps2$Season <- tempSimps2$season

tempPlot <- ggplot(data = tempSimps2, aes(x=id, y=rating, col=Season, shape=Season)) + geom_point() + 
  ylab("Rating") + xlab("Episode Number") + ggtitle("Rating vs Episode") +
  geom_line() + scale_shape_manual(values = rep(0:14, 3)) + scale_y_continuous(name="IMDB Rating", limits=c(4,10))

ggplotly(tempPlot)
#ggplot(data = simpsons, aes(x=id, y = rating)) + geom_point()
#ggplot(data = simpsons, aes(x=original_air_date, y = rating)) + geom_point()
  • General negative trend as expected per the seasonal graph(s)
    • Start / end of season no notable patterns of yet granted it is somewhat hard to observe
      • More detailed graph(s) later
    • i.e. season premiere vs finale (Presumably because of episodic nature and no overarching plotline)

Viewers

tempPlot <- ggplot(data=tempSimps2,aes(x=id, y=viewers, col=Season, shape=Season)) + geom_point() + 
  ylab("Viewer") + xlab("Episode Number") + ggtitle("Episode vs Viewer") + geom_line() + geom_line() + 
  scale_shape_manual(values = rep(0:14, 3))

#theme(legend.position = "none")

ggplotly(tempPlot)
#ggplot(data = simpsons, aes(x=id, y = rating)) + geom_point()
#ggplot(data = simpsons, aes(x=original_air_date, y = rating)) + geom_point()
  • Can see more detailed info when zoomed in accordingly
    • However, graph is a little tight to get more consive view
  • Overall trend does show that rating does plateau (as expected from seasonal boxplots)

Season Specific

Set 1 (1 - 12) Set 2 (13 - 23) Set 3 (24 - 33)
Increase 10 N/A 28
Neutral 1, 4, 6, 11, 12 16 27
Decrease 2, 3, 5, 7, 8, 9 13-15, 17-23 24-26, 29-33
  • Just rough estimations/observations, but note that overall trend is generally decrease in score
  • Not surprising as not earlier (not a series with over arching season plot)
  • S1 doesn’t have a THOH episode
  • THOH episodes within first 5, some cases where it is first episode of the season
  • In general Treehouse of Horror (THOH) episodes seem to be in upper half of viewership per season
  • Cases where THOH episodes highest of season:
    • 5, 6 15 (?), 17, 19, 20, 32, 33

Error in season 33?

s33 <- get_season(33)
  #tempGG <- ggplot(temp, aes(x = id, y = viewers)) + geom_point() +
  #  ggtitle(glue('Season {i}')) + xlab("Episode") + ylab("Viewers")
highlight_g <- data.frame(s33[1,])
highlight_g[2, ] <- tail(s33, n=1)
  
  tempGG <- ggplot(s33, aes(x = id, y = viewers)) + geom_point() + ggtitle(glue('Season {i}')) + xlab("Episode") + ylab("Viewers") + geom_point(data=highlight_g, aes(x = id, y = viewers), col="red", size=3)

ggplotly(tempGG)
  • Error in how inserted two-parter episode, fixed by sorting by episode id
  • While fixing isn’t important of original intent of comparing first vs last episode, but it’s still better for consistency
s33_fixed <- arrange(s33, id)
highlight_g <- data.frame(s33_fixed[1,])
highlight_g[2, ] <- tail(s33_fixed, n=1)
  
  tempGG <- ggplot(s33_fixed, aes(x = id, y = viewers)) + geom_point() + ggtitle(glue('Season {i}')) + xlab("Episode") + ylab("Viewers") + geom_point(data=highlight_g, aes(x = id, y = viewers), col="red", size=3)

ggplotly(tempGG)

oldViewers <- ggplot(data = old_simps, 
                     aes(x = id, y = us_viewers_in_millions, col = as.factor(season))) + 
  geom_point() + ylab("Viewers") + xlab("Episode Number") +
  ggtitle("Num Viewers vs Episode (Old)") + 
  theme(legend.position = "none") + 
  scale_y_discrete(breaks=seq(0, 12, 2))

ggplotly(oldViewers)
  • Changing predictor variable to viewer count is interesting as pattern does not match way I thought it would
    • i.e. Lower ratings = lower viewer, but need to reflect about how sample size would affect ratings
    • Viewer count is in millions
    • Maybe an on/off season schedule? (Spring/Summer vs Fall/Winter)
    • NOTE: Some data is missing are some us_viewers_in_millions values hard-coded as N/A (char class)
      • 159, 160, 172
  • Old graph where data not cleaned/refactored, note the peculiar pattern (above)
#simpsons$viewers <- as.numeric(simpsons$viewers)
#(refactoredViewer <- ggplot(data = simpsons, aes(x = id, y = viewers, col = as.factor(season))) + 
#  geom_point() + ylab("Viewers") + xlab("Episode Number") + ggtitle("Num Viewers vs Episode") +
#  scale_y_continuous(name="Viewers (in mil)", limits=c(0,30)))


#ggplot(data = tempSimps, aes(x = id, y = viewers, col = as.factor(season))) + geom_point()
  • Upon refactoring viewer count to numeric trend does indeed follow expectations (as noted from prior graphs)

  • THOH 31 is lowest, and lower even within the immediate range / season

Writers / Directors

#length(unique(simpsons$writers))
#length(unique(simpsons$directors))

#mean(nchar(simpsons$writers))
avgWriter <- simpsons[which(nchar(simpsons$writers) > 16.6069 ),]

#ggplot(simpsons, aes(writers, fill = writers)) + geom_bar() +
  #coord_polar("y", start = 0)
  • 187 unique writers records (Some entries have multiple people, i.e. outliers mentioned above - 2 part episodes)
    • The way data obtained caused errors/inconsistencies within data
      • Scrapped using different delimiters (&, spaces, or no space at all)
      • Some entries scraped including “Story by:” and/ok “Teleplay by:”
        • Different semantics based off the WGA (How did WGA Strike of 07 affect quality)
          • Treating both as writer to look for unique names
        • e.g. id == 176
  • 57 unique director listings
writers <- data.frame("writers" = tempSimps2$writers, "length"=nchar(tempSimps2$writers))
long_writer <- writers %>% filter(length > mean(length))
long_writer <- writers %>% filter(length >= 20)
short_writer <- writers %>% filter(length <= 19)

long_writer$writers <- str_remove_all(long_writer$writers, "Story by:")
#long_writer <- str_remove_all(long_writer$writers, "Story by:")
  • First pass comparing to nchars to avg len of writers
    • Upon further inspection some names of single writers of episode longer than avg, did manual search 20 is good cutoff
      • One instance of two writers where len is 19 because no space
  • Otherwise 123 instance of episodes of multiple writers
    • First pass, removing teleplay / story

Data Analysis

Predictive Model

Sentiment Analysis off Descriptions

#———————-

Misc Resources